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Abstract 

We study and compare the learning dynamics of two universal learning 
algorithms, one based on Bayesian learning and the other on prediction with 
expert advice. Both approaches have strong asymptotic performance guaran- 
tees. When confronted with the task of finding good long-term strategies in 
repeated 2x2 matrix games, they behave quite differently. 



1 Introduction 

Today, Data Mining and Machine Learning is typically treated in a problem- specific 
way: People propose algorithms to solve a particular problem (such as learning to 
classify points in a vector space), they prove properties and performance guaran- 
tees of their algorithms (e.g. for Support Vector Machines), and they evaluate the 
algorithms on toy or real data, with the (potential) aim to use them afterwards in 
real-world applications. In contrast, it seems that universal learning, i.e. a single al- 
gorithm which is applied for all (or at least "many") problems, is neither feasible in 
terms of computational costs nor competitive in (practical) performance. Neverthe- 
less, understanding universal learning is important: On the one hand, its practical 
success would lead a way to Artificial Intelligence. On the other hand, principles 
and ideas from universal learning can be of immediate use, and of course Machine 
learning research aims at exploring and establishing more and more general concepts 
and algorithms. 

Because of its practical restrictions, most of the understanding of universal 
learning so far is theoretical. Some approaches which have been suggested in the 
past are (adaptive) Levin search [Lev 73, WS96], Optimal Ordered Problem Solver 
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[Sch02, Sch04] and Reinforcement Learning with split trees [Rin94, McC95] among 
others. For a thorough discussion see e.g. [Hut04]. In this paper, we concentrate on 
two approaches with very strong theoretical guarantees in the limit: the AI^ agent 
based on Bayesian learning [Hut02] and FoE based on Prediction with expert advice 
[PH05]. 

Both models work in the setup of a sequential decision problem: An agent in- 
teracts with an environment in discrete time t. At each time step, the agent does 
some action and receives a feedback from the environment. The feedback consists 
of a loss (or reward) plus maybe more information. (It is usually just a matter of 
convenience if losses or rewards are considered, as one can be transformed into the 
other by reverting the sign. Accordingly, in this paper we switch between both, 
always preferring the more convenient one.) In addition to this instantaneous loss 
(or reward), we will also consider the cumulative loss which is the sum of the instan- 
taneous losses from t = 1 up to the current time step, and the average per round 
loss which is the cumulative loss divided by the total number of time steps so far. 

Most learning theory known so far concentrates on passive problems, where our 
actions have an influence on the instantaneous loss, but not on the future behavior 
of the environment. All regression, classification, (standard) time-series prediction 
tasks, common Bayesian learning and prediction with expert advice, and many 
others fall in this category. In contrast, here we deal with active problems. The 
environment may be reactive, i.e. react to our actions, which is the standard situation 
considered in Reinforcement Learning. These cases are harder in theory, and it is 
often impossible to obtain relevant performance bounds in general. 

Both approaches we consider and compare are based on finite or countably in- 
finite base classes. In the Bayesian decision approach, the base class consists of 
hypotheses or models for the environment. A model is a complete description of 
the (possibly probabilistic) behavior of the environment. In order to prove guar- 
antees, it is usually assumed that the true environment is contained in the model 
class. Experts algorithms in contrast work with a class of decision- makers or experts. 
Performance guarantees are proven without any assumptions in the worst case, but 
only relative to the best expert in the class. In both approaches, the model class 
is endowed with a prior. If the model class is finite and contains n elements, it is 
common to choose the uniform prior i. For universal learning it turns out that 
universal base classes for both approaches can be constructed from the set of all 
programs on some fixed universal (prefix) Turing machine. Then each program nat- 
urally corresponds to an element in the base class, and a prior weight is defined by 
w(program) = 2~^^"3*'*(J"'°f'^"*) (provided that the input tape of the Turing machine 
is binary). The prior is a (sub-) probability distribution on the class, i.e. < 1. 

Contents. The aim of this paper is to better understand the actual learning dy- 
namics and properties of the two universal approaches, which are both "universally 
optimal" in a sense specified later. Clearly, the universal base class is computation- 
ally very expensive or infeasible to use. So we will restrict on simpler base classes 
which are "universal" in a much weaker sense: we will employ complete Markov base 
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classes where each element sees only the previous time step. Although these classes 
are not truly universal, they are general enough (and not tailored towards our appli- 
cations) , such that we expect the outcome to be a good indication for the dynamics 
of true universal learning. The problems we study in this paper are 2x2 matrix 
games. (Due to lack of space, we will not go into the deep literature on learning 
equilibria in matrix games, as our primary interest the universal learning dynam- 
ics.) Matrix games are simple enough such that a "universal" algorithm with our 
restricted base class can learn something, yet they provide interesting and nontrivial 
cases for reactive environments, where really active learning is necessary. Moreover, 
in this way we can set up a direct competition between the two universal learn- 
ers. The paper is structured as follows: In the next two sections, we present both 
universal learning approaches together with their theoretical guarantees. Section 4 
contains the simulations, followed by a discussion in Section 5. 



Paissive problems. Every inductive inference problem can be brought into the 
following form: Given a string a;<t = Xi;t-i ■= xiX2...Xt-i, guess its continuation Xf. 
Here and in the following we assume that the symbols Xt are in a finite alphabet X, 
for concreteness the reader may think of = {0, 1}. If strings are sampled from a 
probability distribution n : X* ^ [0, 1], then predicting according to ii{xt\x<:t), the 
probability conditioned on the history, is optimal. If /i is unknown, predictions may 
be based on an approximation of /j. This is what happens in Bayesian sequential 
prediction: Let the model class M. := {/ii, /i25 •••} be a finite or countable set of 
distributions on strings iii{xi.t\yi;t) which are additionally conditionalized to the 
past actions y<t. The actions are necessary for dealing with sequential decision 
problems as introduced above. We agree on the convention that the learner issues 
action yt before seeing xt- Let {^1,^2, . . .} be a prior on Ai satisfying J^Wi < 1. 
Then the Bayes mixture is the weighted average 



One can show that the ^-predictions rapidly converge to the /^-predictions almost 
surely, if we assume that M. contains the true distribution: n e M.. This is not 
a serious constraint if we include all computable probability distributions in M.. 
This universal model class corresponds to all programs on a fixed universal Turing 
machine (cf. the introduction and [Sol64, Hut04]). 

In a passive prediction problem, the behavior of the environments /ij do not 
depend on our actions yi:t. Here we may interpret our action as the prediction 
of Xt- Assume that i : {yt,Xt) ^ [0, 1] is a function defining our instantaneous loss. 
Then the average per round regret of ^ tends to at least at rate t"^^"^, precisely 



2 Bayesian Sequential Decisions (AI^) 
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Here, L\.^ is the cumulative /^-expected loss of the .^-predictions. The (^-prediction 
(and likewise the //-prediction) is chosen Bayes optimal for the given loss function: 
ul = argminj,^ Y^xt^iyt^^t)C{xi:t\yi:t)- The difference L{.^ - L^.^ is termed regret. 

Active problems. If the environment is reactive, i.e. depends on our action, then 
it is easy to construct examples where the greedy Bayes optimal loss minimization 
is not optimal. Instead, the far-sighted Al^-agent chooses the action 

yl'"^ = argminV] ... minV] U.t+d(,{xi.,t+d\yi:t+d)- (2) 
yt ^ — ' vt+d^ — ' 

Xt+d 

where if.t+d = ^{yt-.t+d^ Xf.t+d) = ^^^t^iVs^^s) and d is the depth of the expectimin- 
tree the agent computes by means of (2). We refer to t -|- o? as the (current) horizon. 
If we knew the final time T in advance and had enough computational resources, we 
could choose d — T — t according to the fixed horizon T. Taking d fixed and small 
(e.g. d = 8) is computationally feasible, this is the moving horizon variant. However, 
this can cause consistency problems: A sequence of actions which is started some 
time step t may not seem favorable any more in the next time step t + 1 (since the 
horizon shifts), and thus is disrupted. We therefore also use an almost consistent 
horizon variant which takes d = 8 in the first step, then d — 7, and so on down to 
d = 2, after which wc start again with d = 8. (Actually, we do not go down to d — 1, 
since then the agent would be greedy, which can for instance disrupt consecutive runs 
of cooperation in the Prisoner's dilemma, see below.) A theoretically very appealing 
alternative is to consider the future discounted loss and infinite depth, which is a 
solution of the Bellman equations. This can be found in [Hut04] , together with more 
discussion and the proof of the following optimality theorem for AI^. 

Theorem 1 (Performance of AI^) // there exists a self- optimizing policy p in 
the sense that its expected average loss ^L^.j, converges for T — > oo to the optimal 
average ^L'^.j- for all environments /i & M., then this also holds for the universal 
policy ^, i.e. 

Matrix games (as defined in Section 4) are straightforward in our setup. We just 
have to consider that the opponent, i.e. the environment, does not know our action 
yt when deciding its reaction xf. fiiixt\x<t,yi:t) = l^iixt\x<t,y<t)- AI^ for 2 x 2 
matrix games can then be implemented recursively as shown in Figure 1, if we addi- 
tionally assume that the environments are Markov players with two internal states, 
corresponding to the reaction Xt they are playing. Since in step t, we don't know Xt 
yet, AI^ must evaluate both AIXIrec(0, a;<t, y<t, d) and AIXIrec(l, a;<t, y<t, d) and 
compute a weighted mixture for both possible actions a = 0, 1. As long as we do 
not yet know the loss matrix i : {yt,Xt) ^ [0, 1] completely (which we assume to 
be deterministic), we additionally compute an expectation over all assignments of 
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function I = A{sq, x^t, y<t, d) 
e := £(0, So), £1 :=£(!, So) 
If > 1 Then 
For a e {0, 1} 
For s G {0, 1} 
£(s,a) := Ais,[x^tso],[y<to],d-1) 
r:=r + C(s|so,a,a;<4,2/<i)-£(^'») 
Return mm{£^,f} 

Figure 1: The AI^ recursion for known loss matrix £. 



For T = 1, 2, 3, . . . 

Sample r^. G {0, 1} independ. s.t. P[r^ = 1] = 7^ 
If r,- = Then 
Invoke subroutine FPL{t): 

Sample ~ Exp independently for 1 < i < n 
Select /™ = arg min {rjrt^j. + lnw^ — q^} 

l<i<n 

(end of subroutine FPL{t)) 
Play 1^°^ :— 7™ for Bt elementary time steps 
Set = for all 1 < i < n 
Else 

Sample J^"^ G {l...n} uniformly 

Play / := for B^- elementary time steps 

Let to(r) := El^i Br' and ii := El^wti ^* 
Let ii = iin/-ft and = for alH 7 

Figure 2: The algorithm Foi?. The parameters rjt, 7t, and 
Bf will be specified in Theorem 2. 



losses which are consistent with the history. To this aim, we pre-define a finite 
set C G N which contains all possible losses. In the simulations below, we use 
£ = {0 . . . 4} U { — 16}, where the actual losses are always in {0 ... 4} and the large 
negative value of —16 encourages the agent to explore as long as he doesn't know 
the losses completely. This is for obtaining interesting results with moderate tree 
depth: otherwise, when the loss observed by AI^ is relatively low, AI^ would explore 
only with a large depth. This phenomenon is explained in detail in Section 4. 

Markov Decision Processes (mdp) have probably been most intensively studied 
from all possible environments. In an MDP, the environmental behavior depends 
only on the last action and observation, precisely fi{xt\x^t,y<t) = tJ'{xt\xt-i,yt-i) in 
case of a matrix game. For a 2 x 2 game, a Markov player is modelled by a 2 x 2 x 2 
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transition matrix. It turns out that the (uncountable) class of all transition matrices 
with a uniform prior admits a closed-form solution: 

]\fyt-i j^i 

where N^.*^^^^^ counts how often in the history the state Xt-i transformed to state Xt 
under action yt-i- This is just Laplace' law of succession [Hut04, Prob.2.11&5.14]. 
(Observe that ^ is not Markov but depends on the full history.) Note that the ^ pos- 
terior estimate changes along the expectimin tree (2). Disregarding this important 
fact as is done in Temporal Difference learning and variants would result in greedy 
policies who have to rescue exploration by ad-hoc methods (like e-greedy). One can 
show that there exist self-optimizing policies p for the class of ergodic MDPs [Hut04]. 
Although the class of transition matrices contains non-ergodic environments, a vari- 
ant of Theorem 1 applies, and hence the Bayes optimal policy is self-optimizing 
for ergodic Markov players (which we will exclusively meet). The intuitive reason is 
that the class is compact and the non-ergodic environments have measure zero. 

3 Acting with Expert Advice (FoE) 

Instead of predicting or acting optimally with respect to a model class, we may con- 
struct an agent from a class of base agents. We show how this can be accomplished 
for fully active problems. The resulting algorithm will radically differ from the AI^ 
agent. 

Prediction with expert advice has been very popular in the last two decades. 

The base predictors are called experts. Our goal is to design a master algorithm 
which in each time step t selects one expert i and follows its advice (i.e. predicts 
as the expert does). Thereby, we want to keep the master's regret t^^^"''' — il.j, 
small, where i^.rp is the cumulative loss of the best expert in hindsight at time 
T. Usually, T > 1 not known in advance. The state-of-the-art experts algorithms 
achieve this: Loss bounds similar to (1) can be proven, with i'^.j, replaced by i^.rp and 
replaced by the prior weight of the best expert in hindsight, w*. These bounds 
hold in the worst case, i.e. without any assumption on the data generating process. 
In particular, the environment which provides the feedback may be an adaptive 
adversary. Since these bounds imply bounds in expectation in the Bayesian setting 
(with slightly larger constants than (1)), expert advice is in a sense the stronger 
prediction strategy. 

In order to protect against adaptive adversaries, we need to randomize. In this 
work, we build on the Follow the Perturbed Leader FPL algorithm introduced by 
[Han57]. (For space constraints, we won't discuss the more popular alternative of 
weighted sampling at all.) We don't even need to be told the true outcome after the 
master's decision. All we need for the analysis is learning the losses of all experts, 
which are bounded wlog. in [0, 1] (this is an important restriction which applies to 
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all standard experts algorithms). In this way, the master's actual decision is based 
on the past cumulative loss of the experts. A key concept is that we must prevent 
the master from learning too fast (or to slowly). This is achieved by introducing 
a learning rate rit, which decreases to zero at an appropriate rate with growing t. 
Most of the literature assumes experts classes of finite size n with uniform prior -, in 
particular when the learning rate rjt is non-stationary. For FPL, the case of arbitrary 
non- uniform prior and countable expert classes has been treated in [HP04].^ 

Active problems. In the passive full observation game discussed so far (i.e. we 
learn all losses), the notion of regret is not problematic even against an adaptive 
adversary.^ However, the situation changes if the reaction of the environment de- 
pends on our past actions. Consider the simple case of two experts, one always 
suggesting action and the other one action 1. The environment is reactive and 
"unfair": Each expert incurs no loss as long as we stay with its initial action (e.g. 
the action sequences 00000 and 111 have no loss). But as soon as we perform a 
different action (e.g. 001), in all subsequent rounds both experts incur loss 1. Each 
sensible strategy will soon explore both actions, and compared to the pure experts, 
we incur large loss. Consequently, we need to consider a different notion of regret: 
Our performance is compared to what an expert could achieve when he is actually 
put in our situation. In this example, after the action sequence 001, we perform 
badly, but so do all experts. 

Another problem with reactive environments is that we do not necessarily get 
valid feedback for all experts in each round. In the previous example, if wc chose 
as the first action and learned that expert had no loss at time t = 2, it is not 
legitimate to make any assumption on the loss of expert 1 at t = 2. Even if the 
environment tells us that the pure expert 1 had no loss, we are interested in the loss 
of expert 1 put in our situation, i.e. after the first action 0. But this loss we do not 
know. Precisely, we know only the loss of an expert with the correct action history 
after the last time step in the past, where we (maybe coincident ally) acted as he 
suggested. Instead of trying to track the action history (which is possibly expensive), 
we therefore use only the feedback from the currently selected expert i and discard all 
other information. This is commonly referred to as bandit setup. Fortunately, this 
issue can be successfully addressed by forcing exploration, i.e. sampling according to 
the prior, with a certain probability 7^ [MB04]. This exploration rate 7^ is decreased 
to zero appropriately with growing t. Thus, in each time step we decide to either 
follow the perturbed leader or explore. Accordingly, we call our algorithm FoE 
(Follow or Explore). Bounds for the bandit setup are typically similar to (1), but 

^ Given the large amount of recent literature, it should be not too difficult to obtain similar 
assertions for WS algorithms. However, as far as we know, the only result proven up to now 
requires rapidly decaying weights [Gen03] , which is therefore not appropriate for universal expert 
classes. 

^One can even prove the following strong statement [HP05, Pol05]: If a strategy performs well 
against an oblivious adversary which does not at all depend on our actions, then it also performs 
well against an adaptive adversary. 
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with — In w* replaced by (something larger than) 1/w*. Hence they are exponentially 
larger in w*, and one can show that this is sharp in general. 

Increasing horizon. If the environment is reactive, it is not sufficient to consider 
only the short-term performance of the selected expert i. This was first recognized 
by [dFM04], who considered the repeated game of "Prisoner's Dilemma" and the 
"tit for tat" opponent as a motivating example (see Section 4 for details). In this 
case, a good long term strategy is cooperating (because of the particular opponent). 
However defecting is dominant, i.e. the instantaneous loss of defecting is always 
smaller than that of cooperating. So in order to notice that an expert (for instance 
the always cooperating one) performs well, we have to evaluate it at least over two 
time steps. In general, if we evaluate a chosen expert over an increasing number of 
time steps, we hope that we perform well in arbitrary reactive environments. This 
means that the master works at a different time scale r: in its rth time step, it 
gives the control to the selected expert for Bt > 1 time steps (in the original time 
scale t). As a consequence, the instantaneous losses which the master observes are 
no longer uniformly bounded in [0, 1] , but in [0, Br] . Fortunately, it turns out that 
the analysis remains valid if Br grows unboundedly but slowly enough. Only the 
convergence rate of the average master's loss to the optimum is affected: we will 
obtain a final rate of t~^/^°. The resulting algorithm FoE (for a finite expert class) is 
specified in Figure 2 together with its subroutine FPL. We may have instantaneous 
and cumulative losses in both time scales, this is always clear from the notation (e.g. 
£1 vs. t^r)- Not surprisingly, most of FoE works in the master time scale. 

Note that FoE makes use of its observation only if he decided to explore, i.e. 
if r,- = 1. This seems an unnecessary waste of information. This is motivated 
from the analysis, since FoE needs an unbiased loss estimate I (with respect to 
FoE^s randomization). We just chose the simplest way to guarantee this. For the 
simulations, we concentrate on the following faster learning variant: approximate 
the probability of the selected expert i (jointly for exploration and exploitation) 
by a Monte-Carlo simulation. Then always learn a (close to) unbiased estimate 
i'r = -r- The analysis of FoE works in the same way for this modification, however 
not resulting in better bounds. On the other hand, we will see that modified FoE 
learns faster. 

In case of a non-uniform prior and possibly infinitely many experts, the explo- 
ration must be according to the prior weights. This causes another problem: FoE^s 
loss estimates £ need to be bounded, which forbids exploring experts with very small 
prior weights. Hence we define for each expert i, an entering time > 1 (at the 
master time scale). Then FoE (including its subroutine FPL) is modified such that 
it uses only active experts from {i : t > T*}. This guarantees additionally that we 
have only a finite active set in each step, and the algorithm remains computationally 
feasible. 

Theorem 2 (Performance of FoE) Assume FoE acts in an online decision prob- 
lem with bounded instantaneous losses £\ G [0, 1]. Let the exploration rate be ^r = 
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r ^Z"' and the learning rate be r]r = r '^1'^ . In case of uniform prior, choose Br — 
|_r^/^J . In case of arbitrary prior let Br = [t^/^^J andT'^ \{w'')-^^^. Then m case 
of uniform prior, for all experts i and all T > 1 we have 

< + 0(n2T-i/i°), and 

< ^e\..T + 0{n'T-'/'')w.p.l-T-'. 

Consequently, \imsvLpj,^^^{if°^ — i\.rp) < a.s. For non-uniform prior, corre- 
sponding assertions hold with 0-terms replaced by 0(T~^/^° + (w*)~^^T~^). 

The proof of this main theorem on the performance of FoE can be found in 
[PH05]. Similar bounds hold for larger B^- < r*. These bounds are improvable 

[Polos], and it is possible to prove any regret bound 0^(i)'^ + (log ^)T^~^^^, at the 

cost of increasing c where £ — 0. In the simulations, we used Br = t°'^^ for faster 
learning. For playing 2x2 matrix games, we will use the class of all 16 deterministic 
four-state Markov experts. That is, each expert consists of a lookup table with all 
the actions for each of the 4 possible combination of moves in the last round. In the 
first round, the expert plays uniformly random. (Compare the standard results on 
learning matrix games with expert advice by [FS99] .) 

4 Simulations 

As already indicated, it is our goal to explore and compare the performance of the 
two universal learning approaches presented so far, in particular for problems which 
are not solved by passive or greedy learners. To this aim, repeated 2x2 matrix 
games are well suited: 

• they are simple, such that (close to) universal learning is computationally 
feasible even with brute-force implementation; 

• they provide situations where optimal long-term behavior significantly differs 
from greedy behavior (e.g. Prisoner's Dilemma); 

• moreover, we can observe how universal learners can exploit potentially weak 
adversaries; 

• and finally, we can test the two universal learners against each other. 

We begin by describing the experimental setup and the universal learners. After 
that, we will discuss five 2x2 matrix games, presenting experimental results and 
highlighting their interesting aspects. 

Setup. A 2 x 2 matrix game consists of two matrices -Ri, -R2 G M^^^, the first one 
containing rewards for the row player, the second one rewards for the column player. 
(It does not cause any problem that for convenience, we have developed the theory 
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in terms of losses rather than rewards: one may be transformed into the other by 
simply inverting the sign. So for the discussion of the results, we will keep the 
rewards, as this is more standard in game theory.) A single game proceeds in the 
following way: the first player chooses a row action i e {0, 1} and simultaneously 
the second a column action j G {0, 1}, both players without knowing the opponent's 
move. Then reward Rk{i,j) is payed to player k {k = 1, 2), and i and j are revealed 
to both players. A repeated game consists of T single games. We chose T = 20000, if 
at least one opponent is FoE (which has slow learning dynamics, as we will see), and 
T — 100 for the fast learning AI^ (unless it is plotted in the same graph as FoE). If 
at least one randomized player participates, the run is repeated 10 times, and usually 
the average is shown. We will consider only symmetric games, where one player, 
when put in the position of the other player (i.e. when exchanging Ri and R2), has 
a symmetric strategy (maybe after exchanging the actions) . We will meet precisely 
three types of symmetry: in the "Matching Pennies" , Ri — R2 after inverting the 
action of the row player, and in the "Battle of Sexes" game, Ri = R2 after inverting 
both players' actions, and in all other games we Ri = Rj (transpose). In these 
latter games, we will call the action "defect" and 1 "cooperate". All games we 
consider have rewards in {0 ... 4}. The AI^ and FoE agents are used as specified in 
the previous sections, with the classes of all two-state Markov environments and all 
deterministic four-state Markov experts, respectively. For AI^, we will concentrate 
the presentation on the almost consistent horizon variant, since it performs always 
better than the moving horizon variant. For FoE, we will concentrate on the faster 
learning variant. 

Prisoner's Dilemma. This dilemma is classical. The reward matrices are Ri = 
(03) and i?2 = RJ, with the following interpretation: The two players are accused 
of a crime they have committed together. They are being interrogated separately. 
Each player can either cooperate with the other player (don't tell the cops anything), 
or he defects (tells the cops everything but blame the colleague). The punishments 
are according to the players' joint decision: if none of them gives evidence, both get 
a minor sentence. If one gives evidence and the other one keeps quiet, the traitor 
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gets free, while the other gets a huge sentence. If both give evidence, then they both 
get a significant sentence. (There is also an easier variant "Deadlock" which we not 
discuss.) 

It is clear that giving evidence, i.e. defecting, is an instantaneously dominant 
action: regardless of what the opponent does, the immediate reward is always larger 
for defecting. However, if both players would agree to cooperate, this is the "social 
optimum" and guarantees the better long-term reward in the repeated game. A well- 
known instance for this case is playing against the "tit for tat" strategy strategy 
which cooperates in the first move and subsequently performs the action we did in 
the previous move. Similar but harder to learn are "two tit for tat" and "three 
tit for tat" , which defect in the first move and cooperate only if we cooperated two 
respectively three times in a row. Note that although "two tit for tat" and "three tit 
for tat" are not in AI^'s model class, probabilistic versions of the strategies are: if the 
probability of "adversary defected, I cooperate, then the adversary will cooperate in 
the next round" is chosen correctly (namely | for 2-tit for tat and ^ 0.57 for 3-tit 
for tat), then the expected number of rounds I have to cooperate until the adversary 
will do so is 2 respectively 3. 

Figure 4 shows that in most cases, AI^ learns very quickly the best actions. 
(This is the consistent horizon variant, the moving horizon variant will be discussed 
with the next game. Stag Hunt.) If the opponent is mcmorylcss as for example the 
uniform random player, AI^ constantly defects after short time. Against tit for tat 
and two tit for tat, Al^ cooperates after short time. The figure shows the average 
per round rate of cooperation, which after a few exploratory moves converges to 
the optimal action as j. However, AI^ does not learn to cooperate against three 
tit for tat. The reason is the general problem that in order to increase exploration, 
AI,^ needs exponential depth of the expectimin tree. Assume that a certain action 
sequence of length n is favorable against the true environment, which has however 
not too high a current weight. In this instance, cooperating three times in a row is 
favorable against (the probabihstic version of) 3-tit for tat. In order to recognize that 
this is worth exploring, AI,^ has to build a branch of depth n = 3 in the expectimin 
tree, which has (because of the relatively low prior weight) very small probability 
~ exp(— n) however. Then it needs an exponentially large subtree below this branch 
to accumulate enough (virtual) reward in order to encourage exploration. 

One more problem arises when AI^ plays against another AI^. Here, the perfectly 
symmetric setting results in both playing the same actions in each move, hence they 
are not correctly learning. We might try to remedy this by varying the tree depth 
of the second AI,^ (denoted AI^2 in the figure), however it turns out that in this 
case, both AI^'s do not learn at all to cooperate (see [Hut04, Sec. 8. 5. 2] for a possible 
reason) . 

We now turn to the performance of FoE (the faster learning variant) as evaluated 
in Figure 5. As expected, FoE learns much slower than AI^ (note the different time 
scale) . On the other hand, its exploration is strong enough to learn 3-tit for tat (and 
even harder instances). When playing against another instance of FoE, we notice 



11 



-B 1 

cd 

.1 0.8 
9-0.6 



■o 0.4 



0.2 



AIXI vs. random 
AIXI vs. tit4tat 
AIXI vs. 2-tlt4tat 
AIXI vs. 3-tlt4tat 
AIXI vs. AIXI 
AIXI vs. AIXI2 




100 
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Figure 5: FoE (and AI^) in Prisoner's 
Dilemma 



however that they usually do not succeed to overcome the dominance of mutual 
defection. Also when FoE competes with AI^, they tend to learn mutual defection 
rather than cooperation. (Sometimes, they learn cooperation in one or two of the 
possible states of the MDP.) 

Stag Hunt. This game is also known as "Assurance". The reward matrices are 
— (^^) and R2 — Ri- Two players are hunting together. If they cooperate, they 
will catch the stag. However, one player might not trust the other, in which case he 
chases rabbits on his own instead. In this case, the other one won't get anything if 
he tries to cooperate. If both defect, then they are in conflict, and each player gets 
less rabbits. Although the optimum for both players is to cooperate, they need to 
trust each other sufficiently. If one player plays uniformly random, it is better for 
the other to go for the rabbits. Also, defecting has the lower variance. 

Maybe it is surprising to observe that AI,^ (with a depth of 8) does not learn 
to cooperate against 2-tit for tat (Figure 6). The reason is that defecting has a 
relatively good payoff, and therefore exploration is not encouraged as discussed 
in the previous subsection. If the depth of the tree is increased to 9, AI,^ learns 
cooperation against 2-tit for tat (but not against 3-tit for tat). We also see that the 
moving horizon variant of AI^ has even more problems with exploration: It does 
not learn cooperating against 2-tit for tat, even with depth 9. The explanation is 
that even if AI^ decides to explore in one time step, in the next step this exploration 
might not be correctly continued, as the tree is now explored to a different level. This 
observation can also be made for the Prisoner's Dilemma. In fact, the consistent 
horizon variant performs always better than moving horizon. 

As before, FoE learns much slower but explores more robustly (Figure 7), neither 
2- nor 3-tit for tat are a problem. Unlike in the Prisoner's Dilemma, if AI^ and FoE 
are competing, they learn mutual cooperation in almost half of the cases, an average 
over such lucky instances is given in the figure. The same is valid for FoE against 
FoE, while AI^ against AI^ has the same symmetry problem as already observed 
in the Prisoner's Dilemma. The original slower learning variant of FoE reaches the 
same average level of performance only after 10^ time steps instead of 2 • 10"^ steps, 
and moreover with a variance twice as high. 
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Figure 7: Stag Hunt: FoE and its 
slower learning variant 



Chicken. The reward matrices i?i = (^2) and R2 — Rj of the "Chicken" game 
(also known as "Hawk and Dove") can be interpreted as follows: Two coauthors 
write a paper, but each tries to spend as little effort as possible. If one succeeds to 
let the other do the whole work, he has a high reward. On the other hand, if no one 
does anything, there will be no paper and thus no reward. Finally, if both decide to 
cooperate, both get some reward. Here, in the repeated game, it is socially optimal 
to take turns cooperating and defecting.^ Still the best situation for one player is if 
he emerges as the "dominant defector" , defecting in most or all of the games, while 
the other one cooperates. 

If the opponent steadily alternates between cooperating and defecting, then AI^ 
quickly learns to adapt. This can be observed in Figure 8, where the performance 
is given in terms of average per round reward instead of cooperation rate. How- 
ever, AI^ is not obstinate enough to perform well against a "stubborn" adversary 
that would cooperate only after his opponent has defected for three successive time 
steps. Here, Al^ learns to cooperate, leaving his opponent the favorable role as the 
dominant defector. (However, AI^ learns to dominate the less stubborn adversary 
which cooperates after two defecting actions.) When two AI^s play against each 
other, they again have the symmetry problem. Interestingly, if we break symmetry 
by giving the second AI^ a depth of 9, he will turn out the dominant defector (not 
shown in the graph). 

FoE behaves differently in this game (Figure 9). While he learns to deal with 
the steadily alternating adversary and emerges as the dominant defector against the 
stubborn one, he would give precedence to AI^ in most cases. This is not hard 
to explain, since FoE in the beginning plays essentially random. Thus AI,^ learns 
quickly to defect, and for FoE remains nothing but learning to cooperate. However, 
this does not always happen: In the minority of the cases, FoE defects enough such 
that AI^ decides to cooperate, and FoE will be the dominant defector. (Hence the 
average shown in the graph is less clear in favor of AI^.) 

■^We assume that the authors are not very good at cooperating, and that the costs of cooperating 
more than compensate for the synergy. We could assign a reward of 3 instead of 2 to mutual 
cooperation. This is the less interesting situation of "Easy Chicken", where cooperating is the 
optimal long-term strategy like in the previous games. 
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Figure 10: AI^ and FoE in Battle of 
Sexes 
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Figure 9: FoE (and AI,^) in the 
Chicken game 
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Figure 11: AI^ and FoE in Matching 
Pennies 



Battle of Sexes. In this game, a married couple wants to spend the evening 
together, but they didn't settle if they would go to the theater (her preference) or the 
pub (his preference). However, if they fail to meet, both have a boring evening (and 
no reward at all). The reward matrices are R\ = Q^) and R2 = {^2)- Coordination 
is clearly important in the repeated game. Like in "Chicken" , taking turns is a social 
optimum, while it is best for one player if his choice becomes dominant. 

In Figure 10, our universal learners show similar performance like in the Chicken 
game. Both learn to deal with an alternating partner. FoE also learns to dominate 
over a stubborn adversary which plays his less favorite action only after the opponent 
insists three times on that. Al^ is dominated by this stubborn player. However, AI^ 
always dominates FoE. Finally, in contrast to the Chicken game, AI^ against AI^ 
does not have the symmetry problem, but they both learn to alternate. 

Matching Pennies. Each player conceals in his palm a coin with either heads 
or tails up. They are revealed simultaneously. If they match (both heads or both 

tails), the first player wins, otherwise the second. This is the only zero-sum game of 
the games we consider, where Ri = (||1) and R2 = Thus, there is a minimax 

strategy for both players, which is actually uniform random play. On the other 
hand, deterministic repeated play is potentially exploitable by the adversary. 

Figure 11 shows the results for this last game we present. Both AI^ and FoE 
learn to exploit a predictable adversary, namely the player alternating between and 
1. The other games are balanced in the long run, only in the beginning Al,^ succeeds 
to exploit FoE a little. If two AI^s compete, it is important to break symmetry, then 
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both learn to alternate (this situation is shown in the graph). If symmetry is not 
broken, the row player (who tries to match) always wins. 

5 Discussion 

Altogether, universal learners perform well in repeated matrix games. They usually 
learn to prefer the optimal long-term action to greedy behavior (Prisoner's Dilemma 
and Stag Hunt). If possible they are able to exploit a predictable adversary (Match- 
ing Pennies). And they learn good strategics when it is necessary to foresee the 
opponent's action (Chicken and Battle of Sexes). Of the two approaches we pre- 
sented and compared, AI^ learns much faster than FoE, but FoE explores more 
thoroughly. Of course, there is a trade-off between exploration and fast learning. 
Interestingly, it may depend on the adversary (and thus on the environment) if fast 
learning or exploration is the better long-term strategy: In Chicken and Battle of 
Sexes, AI^ profits against FoE by learning fast and dictating its preferred action, 
but looses against the stubborn opponent because of not exploring enough. 
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